16 research outputs found
Evolving NoSQL Databases Without Downtime
NoSQL databases like Redis, Cassandra, and MongoDB are increasingly popular
because they are flexible, lightweight, and easy to work with. Applications
that use these databases will evolve over time, sometimes necessitating (or
preferring) a change to the format or organization of the data. The problem we
address in this paper is: How can we support the evolution of high-availability
applications and their NoSQL data online, without excessive delays or
interruptions, even in the presence of backward-incompatible data format
changes?
We present KVolve, an extension to the popular Redis NoSQL database, as a
solution to this problem. KVolve permits a developer to submit an upgrade
specification that defines how to transform existing data to the newest
version. This transformation is applied lazily as applications interact with
the database, thus avoiding long pause times. We demonstrate that KVolve is
expressive enough to support substantial practical updates, including format
changes to RedisFS, a Redis-backed file system, while imposing essentially no
overhead in general use and minimal pause times during updates.Comment: Update to writing/structur
Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks
Deep neural networks (DNNs) have been shown to tolerate "brain damage":
cumulative changes to the network's parameters (e.g., pruning, numerical
perturbations) typically result in a graceful degradation of classification
accuracy. However, the limits of this natural resilience are not well
understood in the presence of small adversarial changes to the DNN parameters'
underlying memory representation, such as bit-flips that may be induced by
hardware fault attacks. We study the effects of bitwise corruptions on 19 DNN
models---six architectures on three image classification tasks---and we show
that most models have at least one parameter that, after a specific bit-flip in
their bitwise representation, causes an accuracy loss of over 90%. We employ
simple heuristics to efficiently identify the parameters likely to be
vulnerable. We estimate that 40-50% of the parameters in a model might lead to
an accuracy drop greater than 10% when individually subjected to such
single-bit perturbations. To demonstrate how an adversary could take advantage
of this vulnerability, we study the impact of an exemplary hardware fault
attack, Rowhammer, on DNNs. Specifically, we show that a Rowhammer enabled
attacker co-located in the same physical machine can inflict significant
accuracy drops (up to 99%) even with single bit-flip corruptions and no
knowledge of the model. Our results expose the limits of DNNs' resilience
against parameter perturbations induced by real-world fault attacks. We
conclude by discussing possible mitigations and future research directions
towards fault attack-resilient DNNs.Comment: Accepted to USENIX Security Symposium (USENIX) 201
Improving the Dependability of Distributed Systems through AIR Software Upgrades
Traditional fault-tolerance mechanisms concentrate almost entirely on responding to, avoiding, or tolerating unexpected faults or security violations. However, scheduled events, such as software upgrades, account for most of the system unavailability and often introduce data corruption or latent errors. Through two empirical studies, this dissertation identifies the leading causes of upgrade failure-breaking hidden dependencies-and of planned downtime -complex data conversions-in distributed enterprise systems. These findings represent the foundation of a new benchmark for software-upgrade dependability.
This dissertation further introduces the AIR properties-Atomicity, Isolation and Runtime-testing-required for improving the dependability of distributed systems that undergo major software upgrades. The AIR properties are realized in Imago, a system designed to reduce both planned and unplanned downtime by upgrading distributed systems end-to-end. Imago builds upon the idea of isolating the production system from the upgrade operations, in order to avoid breaking hidden dependencies and to decouple the data conversions from the normal system operation. Imago includes novel mechanisms, such as providing a parallel universe for the new version, performing data conversions opportunistically, intercepting the live workload at the ingress and egress points or executing an atomic switchover to the new version, which allow it to deliver the AIR properties.
Imago harnesses opportunities provided by the emerging cloud-computing technologies, by trading resource overhead (needed by the parallel universe) for an improved dependability of the software upgrades. This approach separates the functional aspects of the upgrade from the mechanisms for online upgrade, enabling an upgrade-as-a-service model. This dissertation also describes techniques for assessing the impact of software upgrades, in order to reason about the implications of relaxing the AIR guarantees.</p
P.: Fault-tolerant middleware and the magical 1
Abstract. Through an extensive experimental analysis of over 900 possible configurations of a fault-tolerant middleware system, we present empirical evidence that the unpredictability inherent in such systems arises from merely 1 % of the remote invocations. The occurrence of very high latencies cannot be regulated through parameters such as the number of clients, the replication style and degree or the request rates. However, by selectively filtering out a &quot;magical 1% &quot; of the raw observations of various metrics, we show that performance, in terms of measured end-to-end latency and throughput, can be bounded, easy to understand and control. This simple statistical technique enables us to guarantee, with some level of confidence, bounds for percentile-based quality of service (QoS) metrics, which dramatically increase our ability to tune and control a middleware system in a predictable manner.
No Downtime for Data Conversions: Rethinking Hot Upgrades (CMU-PDL-09-106)
Unavailability in enterprise systems is usually the result of planned events, such as upgrades, rather than failures. Major system upgrades entail complex data conversions that are difficult to perform on the fly, in the face of live workloads. Minimizing the downtime imposed by such conversions is a time-intensive and error-prone manual process. We present Imago, a system that aims to simplify the upgrade process, and we show that it can eliminate all the causes of planned downtime recorded during the upgrade history of one of the ten most popular websites. Building on the lessons learned from past research on live upgrades in middleware systems, Imago trades off a need for additional storage resources for the ability to perform end-to-end, enterprise upgrades online, with minimal application-specific knowledge